SMILES Enumeration as Data Augmentation for Neural Network Modeling of Molecules
نویسنده
چکیده
Simplified Molecular Input Line Entry System (SMILES) is a single line text representation of a unique molecule. One molecule can however have multiple SMILES strings, which is a reason that canonical SMILES have been defined, which ensures a one to one correspondence between SMILES string and molecule. Here the fact that multiple SMILES represent the same molecule is explored as a technique for data augmentation of a molecular QSAR dataset modeled by a long short term memory (LSTM) cell based neural network. The augmented dataset was 130 times bigger than the original. The network trained with the augmented dataset shows better performance on a test set when compared to a model built with only one canonical SMILES string per molecule. The correlation coefficient R on the test set was improved from 0.56 to 0.66 when using SMILES enumeration, and the root mean square error (RMS) likewise fell from 0.62 to 0.55. The technique also works in the prediction phase. By taking the average per molecule of the predictions for the enumerated SMILES a further improvement to a correlation coefficient of 0.68 and a RMS of 0.52 was found.
منابع مشابه
Yarn tenacity modeling using artificial neural networks and development of a decision support system based on genetic algorithms
Yarn tenacity is one of the most important properties in yarn production. This paper addresses modeling of yarn tenacity as well as optimally determining the amounts of the effective inputs to produce yarn with desired tenacity. The artificial neural network is used as a suitable structure for tenacity modeling of cotton yarn with 30 Ne. As the first step for modeling, the empirical data is col...
متن کاملEstimating and modeling monthly mean daily global solar radiation on horizontal surfaces using artificial neural networks
In this study, an artificial neural network based model for prediction of solar energy potential in Kerman province in Iran has been developed. Meteorological data of 12 cities for period of 17 years (1997–2013) and solar radiation for five cities around and inside Kerman province from the Iranian Meteorological Office data center were used for the training and testing the network. Meteorologic...
متن کاملArtificial Neural Network Modeling for Predicting of some Ion Concentrations in the Karaj River
The water quality of the Karaj River was studied through collecting 2137 experimental data set gained by 20 sampling stations. The data included different parameters such as T (temperature), pH, NTU (turbidity), hardness, TDS (total dissolved solids), EC (electrical conductivity) and basic anion, cation concentrations. In this study a multi-layer perceptron artificial neural network model was d...
متن کاملModeling and Simulation of Water Softening by Nanofiltration Using Artificial Neural Network
An artificial neural network has been used to determine the volume flux and rejections of Ca2+ , Na+ and Cl¯, as a function of transmembrane pressure and concentrations of Ca2+, polyethyleneimine, and polyacrylic acid in water softening by nanofiltration process in presence of polyelectrolytes. The feed-forward multi-layer perceptron artificial neural network including an eight-neuron hidde...
متن کاملPrediction of the Liquid Vapor Pressure Using the Artificial Neural Network-Group Contribution Method
In this paper, vapor pressure for pure compounds is estimated using the Artificial Neural Networks and a simple Group Contribution Method (ANN–GCM). For model comprehensiveness, materials were chosen from various families. Most of materials are from 12 families. Vapor pressure data of 100 compounds is used to train, validate and test the ANN-GCM model. Va...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- CoRR
دوره abs/1703.07076 شماره
صفحات -
تاریخ انتشار 2017